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Abstract 

This paper presents a parallel algorithm for the 
solution of the generalized eigenproblem in lin- 
ear elastic finite element analysis, [/Y][$] = 
[M][$][Q], where: [ K ] and [M] are of order N y 
and [Q] is of order q. The parallel algorithm 
is based on a completely connected parallel ar- 
chitecture in which each processor is allowed to 
communicate with all other processors. The al- 
gorithm has been successfully implemented on 
a tightly coupled multiple-instruction-multiple- 
data (MIMD) parallel processing computer, Cray 
X-MP. A finite element model is divided into m 
domains each of which is assumed to process n 
elements. Each domain is then assigned to a pro- 
cessor, or to a logical processor (task) if the num- 
ber of domains exceeds the number of physical 
processors. The macrotasking library routines 
are used in mapping each domain to a user task. 
Computational speed-up and efficiency are used 
to determine the effectiveness of the algorithm. 
The effect of the number of domains, the number 
of degrees-of-freedom located along the global 
fronts and the dimension of the subspace on the 
performance of the algorithm are investigated. 
For a 64-element rectangular plate, speed-ups 
of 1.86, 3.13, 3.18 and 3.61 are achieved on two, 
four, six and eight processors, respectively. 

Nomenclature 

[B]i right-hand side at the I th iteration 

= [M][V] 

assembled global front right-hand 
sides 

[i< ) stiffness matrix of order N.N 
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V<Y d 

domain subspace stiffness matrix 

[A? 

subspace stiffness matrix of order 
q.q 

[M] 

mass matrix of order N.N 

W)* d 

domain subspace mass matrix 

[MY 

subspace mass matrix of order q.q 

N 

total number of degrees-of- 
freedom of system 

9 

dimension of the subspace < N 

IQ] 

eigenvectors of the auxiliary eigen- 
problem 

[Ui+i 

eigenvectors obtained at the I th it- 
eration 

[V'Jff 

unknown variables on global front 

A 

over-relaxation factor 

[«] 

eigenvalues of required subspace of 
order q.q 

m 

eigenvectors of required subspace 
of order N.q 


Introduction 


Large finite element models used in the analysis 
and design of complex structures are not uncom- 
mon and usually require enormous amounts of 
computing time to solve the generalized eigen- 
problem. As a result, the capabilities of sequen- 
tial computers are quickly reaching their ultimate 
peaks and efficient parallel algorithms must be 
investigated to meet these computing needs. 

A major advancement in computer hard- 
ware based upon the unique architecture of par- 
allel processing has the potential to decrease exe- 
cution time by several orders of magnitude. How- 
ever in order to successfully improve the perfor- 
mance, one must select and develop numerical 
processes that take advantage of the parallel ar- 
chitecture of this new generation of computers. 
This paper presents the development, description 
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Figure 1: Finite Element Model Divided into m Domains 


and results of a new parallel numerical algorithm 
using a multi-frontal subspace method to deter- 
mine the natural frequencies and mode shapes of 
large structures. 

Numerical Techniques 

The generalized eigenproblem for a structural 
system is defined as: 

mm = [m»] (i) 

in which [A] and [M] are symmetric, [$] is a 
modal matrix where each column is an eigen- 
vector and [Q] is a diagonal matrix containing 
eigenvalues. For the generalized eigenproblem 
described in Eq. 1, the eigenvalues are real and 
positive and the eigenvectors are orthogonal with 
respect to [A'] and [ M ]. The three most time con- 
suming procedures in the solution of large eigen- 
problems are the creation of element stiffness and 
mass matrices, the solution of linear simultaneous 
equations and the extraction of eigenpairs. The 
efficiency and robustness of the frontal method 1 
for the solution of linear simultaneous equations 
and the modified subspace method 2 for the ex- 
traction of the least dominant eigenpairs, have 
prompted the authors to incorporate them in the 
concurrent solution of large eigenproblems. 

The classical subspace method is reported 
to provide an efficient algorithm for the solu- 
tion of large problems in parallel and sequen- 
tial processing 3 - 4 . The rate of convergence of 
the modified subspace method used in this pa- 
per is faster by an average of 33% compared to 


the classical subspace method 2 . There are cer- 
tain advantages in using the multi-frontal solu- 
tion method in parallel processing 5 . First, since 
the bandwidth in the frontal solution depends 
on the numbering of elements, there is no need 
to renumber the nodes within each domain to 
minimize the bandwidth of the submatrices of 
the domain. In addition, the element numbering 
scheme for both sequential and parallel solutions 
may be left unchanged, thereby forgoing prepro- 
cessing of the finite element for parallel execu- 
tion. Second, load balancing is dependent on the 
frontwidth and the number of elements in each 
domain. Load balancing is therefore relatively 
easier to achieve using the multi-frontal solution 
method. 

Parallel Architecture 

A finite element model is divided into m domains 
each of which consists of n elements (Fig. 1). 
Each domain is then assigned to a physical pro- 
cessor, or to a logical processor (task) if the num- 
ber of domains exceeds the number of physical 
processors. The macrotasking library routines 6 
are used in mapping each domain to a user task. 
The parallel algorithm is based on a completely 
connected parallel architecture (Fig. 2) in which 
each processor is allowed to communicate with 
all other processors. 

Parallel Algorithm 

Fig. 3 shows the logical structure of the paral- 
lel algorithm. Each processor creates the stiff- 
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Figure 2: Network of Completely Connected 

m-Concurrent Processors (Tasks) 


ness and mass matrices, [K] e and [M]% of the 
elements located within its assigned domain, 
and calculate the corresponding right-hand sides, 
[B]f. Random numbers are used to generate 
N.q starting eigenvectors [V] e x . Each processor 
then assembles the element matrices, [A] e and 
[B]f, and simultaneously eliminates the equa- 
tions corresponding to the degrees-of-freedom 
not located along the global fronts (boundaries): 

MWJ i = [BYl (2) 

where: [K]\[V]*\. X and [fl]} are the stiffness 

matrix, approximate eigenvectors and the corre- 
sponding right-hand sides, respectively, of the i th 
domain just after the assembly of matrices and 
before the elimination of the degrees-of-freedom 
during the I th iteration. Although Eq. 2 is never 
formed in the frontal solution, it is given here to 
illustrate the algorithm in a more concise man- 
ner. At the conclusion of the assembly and elim- 
ination steps, two matrix equations are obtained 
for the i th domain in which the subscript 
refers to the degrees-of-freedom located along 
the global fronts and the subscript ‘<f’ refers to 
all other degrees-of-freedom within the domain 


(Fig. 1): 

[uuvr d + [i<uvy d = [B] d (3) 

[K]f[VY f = [B]f (4) 

A synchronization point is established at 
this stage in which each processor waits for all 
other processors to calculate and communicate 
[K]f and [B]r, and to assemble the global front 
matrices [A]ff and [7?]ff. The solution for 
the degrees-of-freedom located along the global 
fronts, [K]ff> is obtained and the process of 
back-substitution within each domain proceeds 
concurrently until [V’]*_£ 1 is calculated at the I th 
iteration for each element. Concurrent process- 
ing continues to calculate the projection of the 
stiffness and mass matrices onto the required 
subspace, [A'j^* and [M]* d l of order q.q for the 
i th domain. This is the second and last syn- 
chronization point in the parallel algorithm at 
which the contribution from all other domains 
are required before proceeding to solve the aux- 
iliary eigenproblem of the modified subspace 7,8 , 
[K]'[Q] — [A/]*[Q][ft]. The selection of the fac- 
tor Pi is documented 2,8 and will not be repeated 
here. More accurate approximation of the eigen- 
vectors [V]f is obtained using: 

[V)hi = Wi[Q] (5) 

The algorithm either terminates or continues to 
iterate until a test of convergence is satisfied. In 
this paper a tolerance level of 10" 7 is imposed on 
the highest order eigenvalue in the subspace, w 2 . 

Analysis of Performance 

To measure the success of the parallel algorithm 
on the Cray X-MP/24 supercomputer, two mea- 
sures axe used: 

Speed-up = SP = (> 1) (6) 

1 v 

CP 

Efficiency = (< 100%) (7) 

m 

where: T $ is the time of sequential algorithm. 

T p is the time of parallel algorithm. 
m is the number of processors used in 
the parallel solution. 
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Figure 3: Parallel Algorithm for the i th Domain 
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Figure 4: 64-Element Plate, All Edges Clamped. 

These two measures are computed for a rect- 
angular plate with 64 identical elements (Fig. 
4). The plate is modelled using an isoparametric 
square plate element of length 2.0 m; the element 
consists of four corner nodes and four mid-side 
nodes amounting to 16 degrees-of-freedom per 
element. The plate properties are: Young’s mod- 
ulus is 1.0 psi, Poisson’s ratio is 0.3, the mass 
density is 1.0 lb sec 2 /m 4 and 1.0 in equaling 
the thickness. The algorithm developed is tested 
against a similar sequential algorithm: FEDA 9 . 
Table 1 shows the first eight eigenvalues for the 
plate. 


A number of test runs were analyzed to ex- 
amine the variables influencing the speed-up and 
efficiency of the algorithm in a dedicated mode. 
The following sections present the results ob- 
tained from a number of example problems for 
the 64-element rectangular plate with all edges 
clamped. 

Number of Domains 

Ideal speed-up for the parallel algorithm should 
be equivalent to the number of domains the finite 
element model has been subdivided into. The 
rectangular plate shown in Fig. 4 is tested to 
determine the speed-up and efficiency on two, 
four, six and eight processors. The decoupled 
plates are shown in Fig. 5 to describe the global 
fronts and element numbering layout. Favorable 
results were obtained on the two and four proces- 
sor models (Table 2). The six and eight processor 
models showed performance degradation due to 
the relatively high number of degrees-of-freedom 
on the global front. 


Table 1: Predicted Eigenvalues 


Order 

Parallel and 

of 

Sequential Solution 

Eigenvalue 

(Tol=10 -7 ) 

1 


2 


3 


4 

0.2017x10"’ 

5 

0.2731x10"’ 

6 

0.3814x10"’ 

7 

0.5401x10-’ 

8 

0.7662x10"’ 


Table 2: Performance of Various Domains with q — 2 


No. of 
Processors 

Figure 

No. 

Number of 
Iterations 

Speed-up 

(Tol=10" 7 ) 

Efficiency 

1 

4 

16 

1.00 

100% 

2 

5-a 

16 

1.86 

93% 

4 

5-b 

14 

3.13 

78% 

6 


16 

3.18 

53% 

8 

5-d 

14 

3.61 

45% 
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c: Six-Domain Configuration 



Figure 5: 64-Element Plate Configurations 
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Table 3: Performance of Two Subspace Iterations 


No. of 

Figure 

Speed-up 

Efficiency 

Processors 

No. 

<1 = 

2 

1 

4 

1.00 

100% 

2 

5-a 

1.68 

84% 

4 

5-b 

2.84 

71% 

6 

5-c 

2.88 

48% 

8 

5-d 

2.83 

35% 


The two-processor model (Fig. 5a) per- 
formed very well with a speed-up of 1.86. When 
comparing the results of Tables 2 and 3, the two- 
domain model gains momentum as the number 
of iterations increases. The global front has only 
13 degrees-of-freedom which is the lowest pos- 
sible number for the two domain model. This 
is the only system where an accurate evaluation 
of the communication links could be verified. It 
was found that transmitting information from 
one processor to another through common blocks 
accumulated minimal overhead. 

Table 2 also shows that the results for the 
four-domain model in Fig. 5b benefited greatly 
from a lower number of iterations in the paral- 
lel solution. The number of iterations taken to 
achieve tolerance plays a big part in determining 
the overall speed-up of the algorithm. Difference 
in roundoff error between the sequential and par- 
allel solutions is suspected to be the reason for 
the different number of iterations. 

Subroutine Evaluation 

To initially start the multitasking package a main 
program was developed to set the synchroniza- 
tion points and map out all domain processors. 
The time taken to perform this task was calcu- 
lated to be between 20 to 60 milliseconds, which 
does not have a big impact on the total execution 
time. At the first of three synchronization points 
all input data is read into task one and passed to 
the other tasks to avoid overhead due to single- 


threaded I/O on the Cray computer. This causes 
a 1.0 second delay until all processors can move 
forward again. A description of the subroutines 
used by parallel FEDA (pFEDA) is located in 
Appendix A. 

In DMATRON, the subroutine that deter- 
mines the first and last appearance of all nodes 
in its domain has a very low speed-up for all sizes 
of domains. This subroutine takes about 2% of 
the total execution time to complete. Some over- 
head is accumulated in this subroutine but is not 
critical to the total execution time. The creation 
of element matrices, [A'] e and [A/] e , is the first 
place where significant speed-up is achieved be- 
cause the finite element model is substructured 
into an equal number of elements in each task; 
the individual tasks should have an ideal speed- 
up of 2.0, 4.0, 6.0 and 8.0 for two, four, six and 
eight processors in subroutine ESTIFF. Referring 
to Table 4, the two, four and eight-domain struc- 
tures show efficiencies of 89%, 90% and 90% for 
subroutine ESTIFF which means some overhead 
has been compiled at this point due to paral- 
lel processing. In the unbalanced six-processor 
model (Fig. 5c) the efficiency is only 80% as a 
result of the extra elements in domains one and 
six. 


Table 4: Subroutine Speed-ups for q = 6 


Subroutine 

Number of Processors 

Two 

Four 

Six 

Eight 

DMATRON 

1.27 

1.96 

2.13 

2.40 

ESTIFF 

1.78 

3.59 

4.78 

7.17 

First Iteration 

DFRONT 

1.60 

1.96 

1.21 


DCONDS 

1.86 

3.63 

4.75 


RELOAD 

1.92 

3.81 

5.08 

7.65 

Second Iteration 

DFRONT 

1.83 

2.55 

2.37 

2.34 

DCONDS 

1.86 

3.64 

4.81 


RELOAD 

1.92 

3.84 

5.06 
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After the element matrices have been gener- 
ated, the program is ready to begin the solution- 
resolution process to determine the natural fre- 
quencies of the system. The most critical subrou- 
tine is DFRONT where the multi-frontal tech- 
nique is implemented along with the assembly 
and elimination of the global front. In addi- 
tion, the second synchronization point is located 
within this subroutine to send/receive data on 
the interface matrices (Fig. 3). 

Number of Degrees-of- Freedom along 
Global Fronts: 

Success of the parallel algorithm is depen- 
dent upon the number of degrees-of-freedom on 
the global front; as the number of domains in- 
crease so does the number of degrees-of-freedom 
on the global front. Subroutine DFRONT be- 
haves progressively worse as the number of 
degrees-of-freedom on the global front and do- 
mains increase which can be seen in Table 4. For 
example, in the two-domain problem (Fig. 5) 
with 13 degrees-of-freedom on the global front 
and 115 degrees-of-freedom remaining in each 
domain, subroutine DFRONT in the first and 
second iteration take up 19% of the total execu- 
tion time. In contrast, the eight-domain problem 
with 91 degrees-of-freedom on the global front 
and 19 degrees-of-freedom remaining in each do- 
main takes 64% of the total execution time. The 
execution time of DFRONT in the first iteration 
is always greater than that of the subsequent sub- 
space iterations because of a lower number of cal- 
culations that are required in subsequent itera- 
tions. 

The remaining two subroutines DCONDS 
and RELOAD perform the calculations of the 
modified subspace method. Some overhead is as- 
sociated with these subroutines but overall their 
speed-ups are consistent and performed very 
well. Imbedded in DCONDS is the final synchro- 
nization point to send/receive all domain sub- 
space matrices. 

In summary, the sources of overhead associ- 
ated with pFEDA are: 


1. The extra coding to implement parallel pro- 
cessing. 


2. Input of data and the map of the pre-front 
needed in the solution. 

3. Communication links used to pass data. 

4. Assembly and elimination of the global front 
performed within each task. 

5. Solution of the auxiliary eigenproblem. 

D imension of the Sub space 

The number of eigenvalues and mode shapes is 
increased to determine its impact on the algo- 
rithm. All factors are kept constant when using 
the 64-element plate shown in Figs. 4 and 5 with 
test runs limited to two subspace iterations. Dis- 
played in Fig. 6 is the speed-up relative to the 
increasing number of eigenvalues (q — 2, 4, 6, 8 
and 10). It shows a steady increase in the speed- 
up from 1.68 to 1.75 for the largest two subspace 
dimensions even though the auxiliary eigenprob- 
lem is solved sequentially. This increase in over- 
all speed-up is the result of higher speed-ups at- 
tained by subroutines DFRONT and RELOAD 
which outweigh the lower speed-up of subrou- 
tine DCONDS where the auxiliary eigenproblem 
is solved. In conclusion, for larger finite element 
problems increasing the subspace size adds no 
extra overhead and shows a steady increase in 
speed-up. 

Conclusions 

The parallel program described in this paper 
was found to be an accurate and effective algo- 
rithm to solve linear finite element eigenproblems 
on the Cray X-MP computer and demonstrated 
that speed-ups in execution time can be achieved 
when compared to a similar sequential algorithm 
(Fig. 7). In the course of this research, the fol- 
lowing conclusions have emerged: 

1. pFEDA takes advantage of the shared and 
“private” memory on the MIMD Cray com- 
puter while successfully using a completely 
connected architecture to transmit informa- 
tion from one processor to another. 
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Figure 6: Impact of Increasing the Subspace 

Dimension 


2 . Communication links appear to be optimally 
synchronized. Overhead due to data con- 
tention and library routines calls are found 
to be of minimal impact on the performance. 

3. Performance for the creation of the stiffness 
and mass matrices and the modified sub- 
space method were extremely encouraging 
and indicate the effectiveness of multitasking 
environment on the Cray X-MP computer. 

4. The major deficiency of pFEDA was the 
elimination of the degrees-of-freedom on the 
global front, as the domains increased so 
did the degrees-of-freedom on the bound- 
ary. The extra sequential calculations per- 
formed by each task to handle the global 
front lowered the speed-up and efficiency 
significantly. 

5. When subdividing a finite element model 
into m domains one should choose the con- 
figuration with the lowest possible degrees- 
of-freedom on the global front for this will 
increase speed-up and efficiency of the par- 
allel solution. 



Figure 7: Effectiveness of pFEDA, q = 6 
and Six Subspace Iterations 


6. Increasing the size of the subspace creates 
no extra overhead even though the auxiliary 
eigenproblem is solved sequentially. 

7. Load balancing, i.e. assigning an equivalent 
amount of work to each task by keeping the 
number of elements and frontwidth equal in 
all domains, is very important in the perfor- 
mance of pFEDA. 

8. Element numbering is an important aspect 
in lowering the frontwidth for the frontal 
technique. 
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1. Subroutine DMATRON: 

The data is checked for fatal or non-fatal 
errors. This routine then determines the last 
appearance of each node (pre-front). The 
size of the global front is also calculated. 

2. Subroutine ESTIFF: 

Creates the element stiffness and mass ma- 
trices. It also creates the element initial 
eigenvectors. 

3. Subroutine DFRONT: 

Assembles and eliminates the stiffness ma- 
trix for the degrees-of-freedom in their last 
appearance up to the global front. Af- 
ter the global front is reached and boundry 
interface matrices are transmitted to all 
tasks, the global front can be assembled and 
eliminated. Immediately afterward back- 
substitution begins to calculate the unknown 
variables within each domain. 

4. Subroutine DCONDS: 

Calculates the projection of [A'] and [A/] 
onto the current subspace for each itera- 
tion and communicates them to all other 
user tasks. After this is completed , the 
auxiliary eigenproblem is solved. Next, a 
better AZ-orthonormalized approximation of 
the required eigenvectors is constructed. 

5. Subroutine RELOAD: 

After each iteration, a set of new fictitious 
inertia loads is calculated for each element to 
be used in the new subspace iteration step. 
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